Keywords

Author

Andreas Blombach, Philipp Heinrich

Published

February 16, 2023

library(tidyverse)

Is significant significant?

While correcting Giesbert’s master’s thesis, Erkentraud notices that Giesbert appears to use the word significant quite often. Because she doesn’t quite trust her subjective impression, she wants to check whether the word actually occurs significantly more often in the thesis than in comparable texts.

To do this, she compares the word frequencies in Giesberts’s master’s thesis with those in a corpus of scientific papers.

In Giesbert’s magnum opus, the word occurs 33 times. The thesis comprises a total of 13974 words. The reference corpus contains exactly one million words, and significant occurs 907 times.

Contingency table:

observed <- matrix(c(33, 13941, 907, 999093), ncol = 2)
colnames(observed) <- c("master's thesis", "corpus")
rownames(observed) <- c("\"significant\"", "other words")
(observed <- as.table(observed))
              master's thesis corpus
"significant"              33    907
other words             13941 999093

Now, our research hypothesis is that the relative frequency of significant is higher in the thesis than in the corpus. This would imply a one-sided hypothesis test, but sadly, this doesn’t quite work here.

Tests based on the \(\chi^2\) distribution will typically be one-tailed (since the \(\chi^2\) distribution is asymmetric) and test the null hypothesis that there is no difference between the frequency distributions in question. Thus the alternative hypothesis is that there is a difference, regardless of its direction:

\(H_0\): The relative frequency of significant is the same in both the thesis and the reference corpus.

\(H_1\): The relative frequency of significant in the thesis is different from that in the corpus.

# draw curve
curve(
  dchisq(x, df = 1),
  xlim = c(0, 6),
  lwd = 2,
  col = "darkblue",
  main = "Chi-squared distribution, df = 1",
  xlab = "Chi-squared",
  ylab = "Probability density"
)

# function to draw area under curve
auc <- function(from, to, length = 50, col = "skyblue") {
  co.x <- c(from, seq(from, to, length = length), to)
  co.y <- c(0, dchisq(seq(from, to, length = length), df = 1), 0)
  polygon(co.x, co.y, col = col)
}

# area which 5% of all values fall into
auc(qchisq(0.05, 1, lower.tail = FALSE), 6)

# add text
text(4.2,
     0.3,
     "5% of values fall into this area\n(which extends beyond the graph's edge)")
arrows(4.4, 0.15, 4.5, 0.05, code = 2, length = 0.1)

# redraw curve (looks nicer)
curve(
  dchisq(x, df = 1),
  xlim = c(0, 6),
  lwd = 2,
  col = "darkblue",
  add = T
)

The actual \(\chi^2\) test:

chisq.test(observed, correct = FALSE) # set correct = TRUE to apply Yates' continuity correction

    Pearson's Chi-squared test

data:  observed
X-squared = 31.48, df = 1, p-value = 2.015e-08

Expected frequencies given the null hypothesis:

For each table cell: row sum * column sum / total sum

expected <- matrix(
  c(
    sum(observed[1,]) * sum(observed[,1]) / sum(observed),
    sum(observed[2,]) * sum(observed[,1]) / sum(observed),
    sum(observed[1,]) * sum(observed[,2]) / sum(observed),
    sum(observed[2,]) * sum(observed[,2]) / sum(observed)
  ),
  ncol = 2
)
colnames(expected) <- c("master's thesis", "corpus")
rownames(expected) <- c("\"significant\"", "other words")
(expected <- as.table(expected))
              master's thesis       corpus
"significant"        12.95453    927.04547
other words       13961.04547 999072.95453

The G-test (a likelihood ratio test) is a better alternative for corpus data (since the \(\chi^2\) test tends to overestimate the significance of rare events):

G <- 2 * sum(observed * log(observed / expected))
# If one cell count is 0, this will result in NaN ...
# Since the function's limit at 0 is 0, we can simply ignore such
# cells:
G <- 2 * sum(ifelse(observed > 0, observed * log(observed / expected), 0))

pchisq(G, df = 1, lower.tail = FALSE)
[1] 2.60305e-06

Effect sizes like phi and Cramér’s V (identical for 2x2 contingency tables) allow us to compare effects across different sample sizes:

phi: \(\sqrt{\frac{\chi^2}{n}}\)

Cramér’s V: \(\sqrt{\frac{\chi^2}{n \cdot df}}\)

Keyword extraction: Comparing (sub)corpora

Which words are most characteristic for a given text or corpus?

To find out, we can compare each word’s frequency to its frequency in a reference corpus (or in another text or corpus, depending on our research interest).

blechtrommel-freq.csv contains the frequencies of all the token/part-of-speech combinations found in the famous novel “Die Blechtrommel” by Günter Grass. grammofon-freq.csv likewise contains frequencies from “Wie der Soldat das Grammofon repariert” by Saša Stanišić.

We will compare the word frequencies from these novels to those from a German reference corpus, the “DWDS-Kernkorpus”, described here (in German): https://www.dwds.de/d/korpora/kern.

The latter word frequencies were retrieved from http://kaskade.dwds.de/dstar/kern/lexdb/view.perl?select=*&from=lex&where=&groupby=&orderby=f+DESC&offset=0&limit=10&_s=submit.

dwds <- read_csv("../data/dwds-kernkorpus-freq.csv.gz")
blechtrommel <- read_csv2("../data/blechtrommel-freq.csv")
grammofon <- read_csv2("../data/grammofon-freq.csv")

Minor systematic tagging differences:

blechtrommel <- blechtrommel |>
  mutate(POS = ifelse(POS == "PROAV", "PAV", POS))
grammofon <- grammofon |>
  mutate(POS = ifelse(POS == "PROAV", "PAV", POS))

Frequency filter, no punctuation marks:

blechtrommel_filtered <- blechtrommel |>
  filter(n >= 5, !str_detect(Token, "^[[:punct:]]+$"))
grammofon_filtered <- grammofon |>
  filter(n >= 5, !str_detect(Token, "^[[:punct:]]+$"))

Text/corpus sizes:

dwds_size <- dwds$Count |> sum()
blechtrommel_size <- blechtrommel$n |> sum()
grammofon_size <- grammofon$n |> sum()

Join tibbles by Token and POS (left join since we’re mostly interested in how the tokens in the novels differ from those in the reference corpus, not vice versa):

blechtrommel_filtered <- blechtrommel_filtered |> left_join(dwds)
Joining, by = c("Token", "POS")
grammofon_filtered <- grammofon_filtered |> left_join(dwds)
Joining, by = c("Token", "POS")

There are a few token/part-of-speech combinations that only occur in the novels. Some of these are expected, as even a very large corpus won’t contain every word that already exists or that an author may make up on the fly.

blechtrommel_filtered |> filter(is.na(Count))
grammofon_filtered |> filter(is.na(Count))

Looking at the tokens in questions, however, we can see that many of them are likely just tagged incorrectly (different taggers were used for the DWDS corpus and the novels). Moving forward, we could either ignore them altogether or set their frequency in the reference corpus to 0 (or a very low value to avoid computational issues in certain association measures). The latter method is better for unusual words, but may vastly overestimate the keyness of incorrectly tagged tokens (so keep those in mind, especially als).

Another way, of course, would be to forgo using POS information in the first place.

blechtrommel_filtered <- blechtrommel_filtered |>
  mutate(Count = replace_na(Count, 0))
grammofon_filtered <- grammofon_filtered|>
  mutate(Count = replace_na(Count, 0))

Let’s now compute the log-likelihood ratio for each frequency difference. First, we’ll need a function for that:

exp2x2 <- function(observed) {
  return(matrix(
    c(
      sum(observed[1,]) * sum(observed[,1]) / sum(observed),
      sum(observed[2,]) * sum(observed[,1]) / sum(observed),
      sum(observed[1,]) * sum(observed[,2]) / sum(observed),
      sum(observed[2,]) * sum(observed[,2]) / sum(observed)
    ),
    ncol = 2
  ))
}

llr <- Vectorize(function(freq1, freq2, corpus_size1, corpus_size2) {
  observed <- matrix(c(freq1, corpus_size1 - freq1,
                       freq2, corpus_size2 - freq2),
                     ncol = 2)
  expected <- exp2x2(observed)
  return(2 * sum(ifelse(observed > 0, observed * log(observed / expected), 0)))
})

We can now apply this vectorised function to our tibbles. We can also compute p-values as seen above:

blechtrommel_filtered <- blechtrommel_filtered |>
  mutate(LLR = llr(n, Count, blechtrommel_size, dwds_size),
         p = pchisq(LLR, df = 1, lower.tail = FALSE))
grammofon_filtered <- grammofon_filtered |>
  mutate(LLR = llr(n, Count, grammofon_size, dwds_size),
         p = pchisq(LLR, df = 1, lower.tail = FALSE))

Let’s have a look at the tibbles, sorted by LLR. For each token, we’ll also add its frequency per million tokens in each corpus.

blechtrommel_filtered <- blechtrommel_filtered |>
  arrange(desc(LLR)) |>
  mutate(fpmt1 = n / blechtrommel_size * 1e6,
         fpmt2 = Count / dwds_size * 1e6)
grammofon_filtered <- grammofon_filtered |>
  arrange(desc(LLR)) |>
  mutate(fpmt1 = n / grammofon_size * 1e6,
         fpmt2 = Count / dwds_size * 1e6)

blechtrommel_filtered
grammofon_filtered

At the top of both tibbles, we see names of characters and places as well as some important content words. We also see some first-person pronouns which is not that surprising – the DWDS corpus contains many texts from domains where these pronouns are rather uncommon (e.g. news and science articles).

Note that even small differences in relative frequencies can be highly significant, due to the large sample size.

Take for example the preposition in. In “Die Blechtrommel”, it appears 2,233 times, resulting in a relative frequency of 0.0093342. In the DWDS corpus, it occurs 1,600,339 times, resulting in a relative frequency of 0.0131826. That’s not exactly a huge difference, is it? But the log-likelihood ratio is approx. 303, resulting in a fabulously low p-value of 8.83e-68 …

You might be tempted to compute the difference between relative frequencies, but the ratio is a better idea (why is that?). This ratio is also known as the relative risk (although this name doesn’t make much sense in our case).1

The odds ratio is another way of quantifying the effect size – although a little less intuitive, as we already know.

odds_ratio <- Vectorize(function(freq1, freq2, corpus_size1, corpus_size2) {
  observed <- matrix(c(freq1, corpus_size1 - freq1,
                       freq2, corpus_size2 - freq2),
                     ncol = 2)
  return(
    (observed[1,1] / observed[2,1]) / (observed[1,2] / observed[2,2])
  )
})

blechtrommel_filtered <- blechtrommel_filtered |>
  mutate(rel_risk = fpmt1 / fpmt2,
         odds_ratio = odds_ratio(n, Count, blechtrommel_size, dwds_size))
grammofon_filtered <- grammofon_filtered |>
  mutate(rel_risk = fpmt1 / fpmt2,
         odds_ratio = odds_ratio(n, Count, grammofon_size, dwds_size))

blechtrommel_filtered |>
  select(Token, POS, LLR, fpmt1, fpmt2, rel_risk, odds_ratio,
         everything()) |>
  head(20)
grammofon_filtered |>
  select(Token, POS, LLR, fpmt1, fpmt2, rel_risk, odds_ratio,
         everything()) |>
  head(20)

The G-test is only one of a variety of association measures and effect sizes, each of which will identify a slightly different set of keyword candidates in a different order.

As an example, here’s the Dice coefficient (or Sørensen–Dice coefficient):

dice <- Vectorize(function(freq1, freq2, corpus_size1, corpus_size2) {
  observed <- matrix(c(freq1, corpus_size1 - freq1,
                       freq2, corpus_size2 - freq2),
                     ncol = 2)
  return(2 * observed[1, 1] / sum(observed[1, 1], observed[1, 2],
                                  observed[1, 1], observed[2, 1]))
})
blechtrommel_filtered <- blechtrommel_filtered |>
  mutate(Dice = dice(n, Count, blechtrommel_size, dwds_size),
             .after = LLR)
grammofon_filtered <- grammofon_filtered |>
  mutate(Dice = dice(n, Count, blechtrommel_size, dwds_size),
             .after = LLR)

blechtrommel_filtered |> head(20)
grammofon_filtered |> head(20)

Finally, we can filter the data to get a clearer picture. In our case, let’s first throw away tokens whose p-value is too high (we’ll use an \(\alpha\) of .001, and to correct for multiple comparisons, we’ll use the Šidák correction). We can also use the rel_risk column to filter. Lastly, let’s only keep content words such as nouns, proper nouns and adjectives.

alpha <- .001
blechtrommel_filtered |>
  filter(p < 1 - (1 - alpha)^(1/n())) |> # Šidák correction
  filter(rel_risk >= 2) |>
  filter(POS %in% c("NN", "NE", "ADJA", "ADJD") |
           str_detect(POS, "^VV"))
grammofon_filtered |>
  filter(p < 1 - (1 - alpha)^(1/n())) |>
  filter(rel_risk >= 2) |>
  filter(POS %in% c("NN", "NE", "ADJA", "ADJD") |
           str_detect(POS, "^VV"))

Conservative LogRatio (Evert 2022):

binom.confint <- function(k, n, conf.level = 0.95, correct = FALSE,
                          alternative = c("two.sided", "less", "greater")) {
  alternative <- match.arg(alternative)
  stopifnot(all(k >= 0) && all(k <= n) && all(n >= 1))
  stopifnot(all(conf.level >= 0) && all(conf.level <= 1))

  ## significance level for underlying hypothesis test (with optional Bonferroni correction)
  alpha <- if (alternative == "two.sided") (1 - conf.level) / 2 else (1 - conf.level)
  if (correct) alpha <- alpha / length(k) # Bonferroni correction
  alpha <- rep_len(alpha, length(k))      # needs to be vector for safe.qbeta() 
  
  ## Clopper-Pearson method: invert binomial test (using incomplete Beta function)
  lower <- safe.qbeta(alpha, k, n - k + 1)
  upper <- safe.qbeta(alpha, k + 1, n - k, lower.tail=FALSE)
  switch(alternative,
         two.sided = data.frame(lower = lower, upper = upper),
         less      = data.frame(lower = 0,     upper = upper),
         greater   = data.frame(lower = lower, upper = 1))
}

## safely compute qbeta even for shape parameters alpha == 0 or beta == 0
safe.qbeta <- function (p, shape1, shape2, lower.tail = TRUE) {
  stopifnot(length(p) == length(shape1) && length(p) == length(shape2))
  is.0 <- shape1 <= 0
  is.1 <- shape2 <= 0
  ok <- !(is.0 | is.1)
  x <- numeric(length(p))
  x[ok] <- qbeta(p[ok], shape1[ok], shape2[ok], lower.tail = lower.tail) # shape parameters are valid
  x[is.0 & !is.1] <- 0 # density concentrated at x = 0 (for alpha == 0)
  x[is.1 & !is.0] <- 1 # density concentrated at x = 1 (for beta == 0)
  x[is.0 & is.1] <- NA # shouldn't happen in our case (alpha == beta == 0)
  x
}

LRC <- function (f1, f2, N1, N2, conf.level = .95, correct = TRUE) {
  stopifnot(length(f1) == length(f2))
  stopifnot(all(f1 + f2 >= 1))

  ## exact confidence interval from conditional Poisson test (two-sided)
  tau <- binom.confint(f1, f1 + f2, conf.level = conf.level, correct = correct, alternative = "two.sided")
  ifelse(f1 / N1 >= f2 / N2, 
         pmax(log2( (N2 / N1) * tau$lower / (1 - tau$lower) ), 0),  # p1 >= p2 -> use lower bound (clamped to >= 0)
         pmin(log2( (N2 / N1) * tau$upper / (1 - tau$upper) ), 0))  # p1 < p2  -> use upper bound (clamped to <= 0)
}
blechtrommel_filtered <- blechtrommel_filtered |>
  mutate(LRC = LRC(n, Count, blechtrommel_size, dwds_size),
         .after = LLR)
grammofon_filtered <- grammofon_filtered |>
  mutate(LRC = LRC(n, Count, blechtrommel_size, dwds_size),
         .after = LLR)

blechtrommel_filtered |>
  arrange(desc(LRC)) |>
  filter(POS %in% c("NN", "NE", "ADJA", "ADJD") |
           str_detect(POS, "^VV"))
grammofon_filtered |>
  arrange(desc(LRC)) |>
  filter(POS %in% c("NN", "NE", "ADJA", "ADJD") |
           str_detect(POS, "^VV"))

Footnotes

  1. Andrew Hardie likes to take the binary logarithm of this ratio; he then calls this log ratio: http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/↩︎